An Efficient Filtration Method in Biological Sequence Databases

نویسنده

ANTHONY J.T. LEE

چکیده

Sequence comparison is one of the most important primitive operations in bioinformatics. Roughly speaking, this operation finds which parts of sequences are alike and which parts are different. As the size of a sequence database scales to millions of base pairs, it becomes impractical to search the whole database with sequence alignment methods based on the dynamic programming approach which yields quadratic time complexity. Filtration methods are thus proposed in order to screen out most unrelated data sequences in the preprocessing stage. However, existing filtration methods either incurs false negatives or retains too many candidates. In this paper, we proposed a filtration method called Transformation-based Database Filtration method (TDF) which consists of two phases. First, we divide the data sequences into several blocks, each of which is transformed into a feature vector by Haar wavelet transform. Then, we build an index for them. In the second phase, we search the index and extract those candidate blocks whose distance to the feature vector of the query sequence is less than a predefined threshold. Finally, for each candidate block, we calculate the edit distance between the corresponding data sequence and the query sequence. Experimental results show that our method prunes a large portion of the database and guarantees no false negative.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Protein Databases

Proteins are sources of many peptides with diverse biological activity. Some of them are considered as valuable components of foods and drug targets with desired and designed biological activity. We are now entering an era rich in biological data in which the field of bioinformatics is poised to exploit this information in increasingly powerful ways. There are currently many databases all over ...

متن کامل

Using Transformation Techniques Towards Efficient Filtration of String Proximity Search of Biological Sequences

The problem of proximity search in biological databases is addressed. We study vector transformations and conduct the application of DFT(Discrete Fourier Transformation) and DWT(Discrete Wavelet Transformation, Haar) dimensionality reduction techniques for DNA sequence proximity search to reduce the search time of range queries. Our empirical results on a number of Prokaryote and Eukaryote DNA ...

متن کامل

BFT: A Relational-based Bit Filtration Technique for Efficient Approximate String Joins in Biological Databases

Joining massive tables in relational databases have received substantial attention in the past decade. Numerous filtration and indexing techniques have been proposed to reduce the curse of dimensionality. This paper proposes a novel approach to map the problem of pairwise whole genome comparison into an approximate join operation in the wellestablished relational database context. We propose a ...

متن کامل

Accelerating Smith-Waterman Alignment for Protein Database Search Using Frequency Distance Filtration Scheme Based on CPU-GPU Collaborative System

The Smith-Waterman (SW) algorithm has been widely utilized for searching biological sequence databases in bioinformatics. Recently, several works have adopted the graphic card with Graphic Processing Units (GPUs) and their associated CUDA model to enhance the performance of SW computations. However, these works mainly focused on the protein database search by using the intertask parallelization...

متن کامل

Efficient Filtration of Sequence Homology Search through Singular Value Decomposition

Similarity search in textual databases and bioinformatics has received substantial attention in the past decade. Numerous filtration and indexing techniques have been proposed to reduce the curse of dimensionality. This paper proposes a novel approach to map the problem of whole-genome sequence homology search into an approximate vector comparison in the well-established multidimensional vector...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2004

An Efficient Filtration Method in Biological Sequence Databases

نویسنده

چکیده

منابع مشابه

Protein Databases

Using Transformation Techniques Towards Efficient Filtration of String Proximity Search of Biological Sequences

BFT: A Relational-based Bit Filtration Technique for Efficient Approximate String Joins in Biological Databases

Accelerating Smith-Waterman Alignment for Protein Database Search Using Frequency Distance Filtration Scheme Based on CPU-GPU Collaborative System

Efficient Filtration of Sequence Homology Search through Singular Value Decomposition

عنوان ژورنال:

اشتراک گذاری